In this project my goal is to take the raw data concerning Illinois state prison bookings and anaylze their attributes. The goal is to treat this as though it is being presented to a group of social workers. Possibly aiming to develop criminal profiles and determine courses of action that would help prevent future bookings.
This data came from the Illinois State government website: https://data.illinois.gov. The link to the exact data can be found in the bibliography. The data had 38 attributes, and included date-time, numerical, nominal, and ordinal data. I will be examining attributes such as age at the time of arrest, race, employment status, marital status, level of offense, and type of crime committed. There will also be a time series analysis of the number of bookings per a day to determine if there are any trends or patterns in the booking data. It is the prison booking data for the entire state of Illinois from January 1, 2012 to July 1, 2018. The data contained roughly 67 thousand observations. It is a list of every person that was booked in the prison system during that time, why they were booked, exact dates and times of bookings, and information about the inmate.
It is really important to note that this is not a roster of who is currently in prison in the state of Illinois. There are inmates who are in the prison system who were booked outside of the scope of this data. In addition, this does not contain information revealing the name of the person being booked, nor whether they are a repeat offender. It is possible, if not likely, that there are duplicates in this data in reference to the person being booked.
When looking at the visualizations for this data, keep the above in mind. It mean that when we look at something like race it does not represent the proportion of current inmates who are a certain race, it represents the number of times a person of that race was arrested.
The first part was to examine the data and look for irregularities. There were a total of 924 NaN (Null) values. These were removed from the data before continuing. When first exploring the categorical attributes I used a function to examine all of the unique values for each attribute. Each categorical attribute contained an empty string as one of the categories. It can be assumed that this is when the person booking the inmate did not have information on the inmate for this category. When I removed all of the empty strings I was left with only 2,000 observations. Since this took away such a large portion of my data set I decided to remove them ad hoc. So, I would only remove empty strings from a category when I was working with that category in order to preserve most of the data.
A note on the time series data - When doing the time series, extra data cleaning and feature engineering was required. I had to manipulate the data into a data frame that contained the booking counts for each unique day and the date itself. The date data had to be converted from a string to date-time using the lubridate package. After I had a new data frame with these two values I extracted the month, year, and day of the week into 3 more columns, respectively.
The four main numerical values I had to work with were age at arrest, age at release, days in jail, and hours in jail. I decided to work the most with age at arrest and days in jail. Ultimately, days in jail ended up being far too skewed to work with reasonably.
Below I have gone through the inital visualizations for the age at arrest, with the mean age of arrest was 30.77 or roughly 31. There was a standard deviation of 3.52 years. This still had a strong right skew with the highest peaks of the histogram being people in their mid 20s. I then broke down this information and organized it by race in both histogram and boxplot form.
When I first broke this down, I wanted to see if there was a difference in distribution when grouped by race. It was interesting to discover that all of the distributions remained similar to the overall distribution. The only exception was Native American, but I do not feel this is a fair analysis since there were far fewer from that population in comparison to other races. In accordance with the law of large numbers, it is possible that if there were more Native Americans in the overal population, their skew would be similar to that of other races.
The box plots below contain the five number summary of the age at the time of arrest for each race. Although there is some variance, each group falls within the mid 20s to early 30s age range. This supports the overall average age of arrest being 31.
The central limit theorem suggests that if you have a population mean (mu) and standard deviation (sigma), and take “sufficiently large random samples” with replacement, then the distribution will become bell shaped.(Sullivan, 2016) This theorem is important because it applies to data that has a pronounced tail or skew, like ours. Two important features of this data that allows us to use the Central Limit Theorem is that this data contains the entire population. Since it is from a government website with meticulous records we have the age of every single person, each time they were booked in prison. This is not a sample of the population. The other important distinction is the amount entries the data contains, 67,000. Since our data is heavily skewed, “the samples may need to be much larger for the distribution of the sample means to begin to show a bell shape”. (Kerns,G.J., 2010). Luckily we have plenty of data to pull from. Continuing to use the age at arrest attribute, I have applied the Central Limit Theorem below. I worked with 5000 samples and sample sizes of 10, 20, 30, and 40.
Now using 4 different samples sizes (10, 20, 30, 40) of 1,000 samples, let’s examine the shape of the bell curves. The larger the sample size, the taller and thinner the bell curve becomes with a smaller standard deviation.
I used three different types of sampling methods to explore this data. Simple Random Sampling Without Replacement, Systematic Sampling (Unequal Probabilities), and Stratified Sampling. Based on the look of the histograms, I feel like stratified sampling (while still using race as our categorical variable) was the best method.
Simple random sampling reflected the base data with a strong right skew. Systematic with unequal probabilities was chosen to take into account the weight of different observations. It was a little more even than RSWOR. Stratified looked the closest to a normal distribution. I feel like Stratified Sampling was the closest to a normal distribution because it takes into account all of the different major categories and pulls proportional samples. I used sample sizes of 100 for this.
## Sample.Type Mu Sigma
## 1 Raw Data 30.77503 11.141540
## 2 SRSWOR 31.88000 10.916839
## 3 Systematic 34.03000 11.323931
## 4 Stratified 31.93939 9.898277
When thinking about presenting this to a group of social workers I wanted to build a profile of who is being booked in prison. Ultimately, this could help create a plan of how to help individuals who are being booked in prison. I want to point out that I am going to try to be delicate and sensitive with my inferences in this section. Although we can answer questions like who or what, we can not conclude WHY.
My main hypothesis in exploring this data is that individuals who have less stability are more likely to be booked.
This pie chart is simplistic and dominated by one category. This one category tells us a lot though. 95.8% of people who were booked in the Illinois prison system did not have a military background. Since the military is known for providing discipline and stability this supports the theory that individuals with less stability are more likely to be booked.
When reviewing the data below, it is organized by three different categories.
I put these two plots in because I feel like they are interesting in terms of proportions. I do not want to use them to make any inferences though. I feel like this is part of a larger social issue that is sensitive. Although it is important to look at and note, no solid inferences can be made about the booking of race and sex from these plots.
Again, not too many infrences can be made about men or women based on this, but we can make a few observations.
I found this one very important. The tallest bar, by a considerable amount, is the group of inmate bookings who are considered unemployed. In addition, these inmate bookings consisted of more felonies than misdemeanors.
Looking at this plot we agian see a strong presence of people who were unemployed committing crimes.
Looking at the chart below, the presence of single people dominates the proportion of the population.
Taking a look now at the top 20 crimes that were committed between 2012 and 2018 in the State of Illinois. This has been manipulated to drop categories like “other” and “miscellaneous jail code”.
Some inferences that can be made from this are:
1. Petty crimes are some of the most common crimes with traffic or driving related crimes predominating.
2. Drug and theft were also common in Illinois during the years in question.
3. DUIs are the 3rd most common reason for arrest. This is easily preventable and it could be worth while to look into successful alchol intervention or mitigation programs.
I created a time series of the bookings per day in this data set. As described in the data cleaning section the booking date column was taken from the original data set and cast as a date time data type. Using this singular data type I was able to feature engineer several other attributes. In order to manipulate the data the way I wanted I also put the final data frame into a tibble. I used the fpp2 time series to make adjustments to the data such as differencing and visualizing seasonality. One common trend I was able to identify is that there is always a drop in prison bookings during the month of December. I used an ARIMA model to help with forecasting future trends.
When first looking at a times series plot of day by day there was a lot of noise. I ended up dividing it up by monthly average and year. The first subplot shows the over all linear trend from year to year for average monthly bookings. There is a visible steady decline in the data.
This is another visualization that surprised me. I expected the number of daily bookings to be higher on Friday and Saturday. Monday is the lowest number of daily bookings, but not by much. It is interesting that average daily bookings were pretty steady regardless of day of the week.
As mentioned above the ARIMA model was used to forecast future booking data. ARIMA stands for Autoregressive Moving Average. The ARIMA model is popularly used because it can take into account seasonality, differencing, autoregression, and moving average. The auto regression is takes into account “past values of the variable”. What makes the ARIMA model sophisticated though is the moving average portion of the model. This not only takes into account the regression, but also “uses past forecast errors in a regression-like model”. (Hyndman, 2018)
Unfortunately when looking at the lattermost forecating plot, the blue area indicates a wide confidence interval. Therefore, making the model unreliable. This makes sense, as the future of prison booking dates would be unreliable.
Although a proper set of hypothesis testing was not conducted for this, I feel like there is enough here to continue developing further. I think there is strong enough initial evidence to suggest that an unstable personal life creates a higher likelihood of being booked in prison.
Some theories I would like to investigate further that were not available in this data are:
What is the rate of people booked who have offended before?
What is the post high school education of the persons being booked? (i.e. trade school, some college, ext.)
What is the rate of people who commit alchol or drug related crimes that also commit other crimes. Can this be mitigated?
There is no correct answer. This subject has deep rooted social constructs and sensitive topics surrounding it. I do feel like with further investigation, research, and community outreach this is a problem that can continue to decline.
Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2. Retrieved February 15, 2022
Kerns, G.J. (2010). Introduction to Probability and Statistics using R. Kerns
LaMorte, Wayne & Sullivan, Lisa (2016). The Role of Probability. Boston University School of Public Health, Page 12. https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_probability/index.html
State of Illinois. (2019). Illinois.gov open data sourcing. Retrieved February 7, 2022, from https://data.illinois.gov/dataset/jail-booking-data